Final_Project_Ds2020

Global Air Travel Analysis using OpenFlights

Lucas Martins Sorge, Nina De Grandis, Brandon Merrick

Introduction

This data science project explores global air travel patterns using datasets from the OpenFlights database. We analyze airline networks, airport connectivity, geographic coverage, and operational characteristics to reveal insights into global aviation trends. Our goal during this project is to find when and where air traffic is more concentrated. We also want to analyze the structure of global airline route networks, examine geographic coverage and identify underserved regions, and study operational characteristics, including fleet usage and route lengths.

Questions: - How concentrated is global air traffic? - How does airport connectivity vary between developed and developing countries? - What country has the most airports? Where are the countries with the most airports located, and what patterns are there? - Brandon’s - Brandon’s

Data Sources

Data was obtained from OpenFlights: - airlines.dat: Airlines data including operational status. - airports.dat: Airport location and operational details. - routes.dat: Flight routes between airports. - planes.dat: Aircraft types and equipment information. - countries.dat: Country codes and geographic metadata.

Project Objectives

  • Analyze the structure of global airline route networks.
  • Examine geographic coverage and identify underserved regions.
  • Study operational characteristics, including fleet usage and route lengths.

Completed Steps

  • Data loading and cleaning:
    • Handling missing and invalid data.
    • Filtering for active airlines and valid airports.
  • Joining datasets (routes, airlines, airports).

Methodology

  • Our analysis was conducted entirely in R, leveraging a combination of data wrangling, statistical modeling, and visualization techniques.

Data Cleaning

We cleaned and prepared five OpenFlights datasets (airlines, airports, routes, planes, countries) by:

  • Replacing "\\N" with NA and removing rows with missing critical fields.
  • Filtering for relevant records: active airlines, valid airports, and direct routes only.
  • Converting data types (e.g., IDs, coordinates) for consistency.
  • Standardizing column names using clean_names().
  • Joining datasets: routes were linked with airline, airport, country, and plane details.
  • Exploding and summarizing equipment data for route-level analysis.

The result is a cleaned and merged dataset ready for analysis and visualization.

Results

Question 1: How concentrated is global air traffic?

  • Extreme airport-level inequality
    • Lorenz Curve bows sharply below the line of perfect equality, indicating most flights funnel through a few major hubs.
    • Gini coefficient (airports): 0.78
knitr::include_graphics("figures/lorenz_airport.png")

  • Top-percentile shares
    • Top 1% of airports handle ~20% of all flights
    • Top 5% handle ~53% of all flights
    • Top 10% handle ~70% of all flights
  • Leading hubs
    • The busiest airports—Atlanta (ATL), Chicago O’Hare (ORD), Beijing Capital (PEK), etc.—together account for a disproportionately large share of global traffic.
  • Route-level distribution
    • Lorenz Curve for routes lies closer to the equality line, showing a more even spread across connections.
    • Gini coefficient (routes): 0.31
knitr::include_graphics("figures/lorenz_route.png")

  • Key routes
    • Top connections (e.g., ORD → ATL, JFK → LHR) are busiest but represent a smaller overall share compared to top airports.
  • Conclusion
    • Global aviation has a dual structure: a small number of dominant hubs manage the bulk of air traffic, while a wide range of routes ensures broad global connectivity and operational resilience.

Question 2: How does airport connectivity vary between developed and developing countries?

  • Stronger connectivity in developed countries
    • Airports in developed countries show significantly higher average connectivity than those in developing countries.
    • Average connectivity (number of outgoing routes per airport) is roughly twice as high in developed nations.
  • Statistical evidence
    • Welch Two Sample t-test: p = 0.0079
    • Mann-Whitney U test: p = 0.0228
    • Both confirm that the difference is statistically significant.
  • Top connected countries
    • Most countries with the highest average connectivity (e.g., United States, Germany, France) are developed.
    • Some developing countries like the UAE and Singapore stand out as exceptions due to geographic or economic advantages.
  • Visual insights
    • A global map of airports shows large hubs (colored by development status) are clustered in North America, Europe, and East Asia.
    • Airports in developing countries tend to be more regionally focused with lower integration into global flight networks.
knitr::include_graphics("figures/global_connectivity_map.png")

  • Conclusion
    • Global airport connectivity reflects broader economic inequalities.
    • Developed countries are far more integrated into the air transportation network, both in infrastructure and route diversity.

Question 3: What country has the most airports? Where are the countries with the most airports located, and what patterns are there?

Top 25 Countries by Number of Airports

knitr::include_graphics("figures/airports.png")

  • This bar chart shows the top 25 countries by number of airports, and how many airports they have. I chose to only shows the top 25 because any more countries made the graph very hard to read.
  • There are 235 countries in the data set, and everything below the top 25 has only 45 or less airports.
  • The United States takes a staggering lead, with 1251 airports. The next closest is Canada, with 380.
  • Only the top 10 countries have 100 or more airports. - These top 10 countries are among the most populated countries in the world, which would make sense why they have the most airports.
knitr::include_graphics("figures/countries.png")

  • This map shows where the airports for the top 25 countries by number of airports are located.
  • I notice several trends when looking at this map. First, many of the airports are along the coast. This makes sense as these may be more populated areas, especially when accounting for tourism and economic activity.
  • When talking about population, this may be why some areas are blank and without an airport for miles. Australia for example, has most of their population centered in big cities and along the coasts. An airport in the middle of Australia isn’t very necessary. This is similar to Russia, where there is a seeming lack of airports for the size of the country. Russia’s climate and population distribution accounts for this.
  • Another noticeable trend is that Africa is completely left out of the top 25, except for one country, being South Africa. Most every other region of the world has a country with many airports, making travel very accessible.

Question 4: What airports have the most unique flights to and from them

  • Majority are in the US
    • The highest by far is ATL, but 4 of the top 10 are also in the US (ORD, LAX, DFW, and JFK)
  • China has a few big ones
    • The 3rd largest, PEK, as well as PVG are in China
    • Other Asian countries with large airports are Singapore and South Korea
  • Remaining Large Airports are all in Europe
    • UK, France, and Germany are the main countries with large airports
    • Surrounding Western European countries like Spain and the Netherlands also have some large airports.
knitr::include_graphics("figures/numOfFlights.png")

Question 5: What Brand of Plane is Most Popular

  • Plane Type is concentrated
    • Majority are Boeing, Airbus, Douglas, or Embraer
    • Some only have one of that plane
  • Only 246 planes so pretty small sample size
knitr::include_graphics("figures/planeCount.png")

Conclusion